Search CORE

57 research outputs found

GruPaTo at SemEval-2020 Task 12: Retraining mBERT on Social Media and Fine-tuned Offensive Language Models

Author: Basile Valerio
Caselli Tommaso
Colla Davide
Publication venue: International Committee for Computational Linguistics
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin

Semantic Coherence Dataset: Speech transcripts

Author: Colla Davide
Delsanto Matteo
Radicioni DANIELE PAOLO
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

The Semantic Coherence Dataset has been designed to experiment with semantic coherence metrics. More specifically, the dataset has been built to the ends of testing whether probabilistic measures, such as perplexity, provide stable scores to analyze spoken language. Perplexity, which was originally conceived as an information-theoretic measure to assess the probabilistic inference properties of language models, has recently been proven to be an appropriate tool to categorize speech transcripts based on semantic coherence accounts. More specifically, perplexity has been successfully employed to discriminate subjects suffering from Alzheimer Disease and healthy controls. Collected data include speech transcripts, intended to investigate semantic coherence at different levels: data are thus arranged into two classes, to investigate intra-subject semantic coherence, and inter-subject semantic coherence. In the former case transcripts from a single speaker can be employed to train and test language models and to explore whether the perplexity metric provides stable scores in assessing talks from that speaker, while allowing to distinguish between two different forms of speech, political rallies and interviews. In the latter case, models can be trained by employing transcripts from a given speaker, and then used to measure how stable the perplexity metric is when computed using the model from that user and transcripts from different users. Transcripts were extracted from talks lasting almost 13 hours (overall 12:45:17 and 120,326 tokens) for the former class; and almost 30 hours (29:47:34 and 252,270 tokens) for the latter one. Data herein can be reused to perform analyses on measures built on top of language models, and more in general on measures that are aimed at exploring the linguistic features of text documents

PubMed Central

Institutional Research Information System University of Turin

EliCoDe at MultiGED2023: fine-tuning XLM-RoBERTa for multilingual grammatical error detection

Author: Colla Davide
Delsanto Matteo
Di Nuovo Elisa
Publication venue: Linköping University Electronic Press
Publication date: 01/01/2023
Field of study

Institutional Research Information System University of Turin

LessLex: Linking Multilingual Embeddings to SenSe Representations of LEXical Items

Author: Colla Davide
Mensa Enrico
Radicioni Daniele P.
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin

From Sartre to Frege in three steps: A* search for enriching semantic text similarity measures

Author: Colla Davide
Leontino Marco
Mensa Enrico
Radicioni Daniele P.
Publication venue: CEUR
Publication date: 01/01/2019
Field of study

Institutional Research Information System University of Turin

Conceptual Abstractness: from Nouns to Verbs

Author: Colla Davide
Mensa Enrico
Porporato Aureliano
Radicioni Daniele P.
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2018
Field of study

Institutional Research Information System University of Turin

GruPaTo at SemEval-2020 Task 12:Retraining mBERT on Social Media and Fine-tuned Offensive Language Models

Author: Basile Valerio
Caselli Tommaso
Colla Davide
Granitzer Michael
Mitrović Jelena
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

University of Groningen

GruPaTo at SemEval-2020 Task 12:Retraining mBERT on Social Media and Fine-tuned Offensive Language Models

Author: Basile Valerio
Caselli Tommaso
Colla Davide
Granitzer Michael
Mitrović Jelena
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

ARTS repository - University of Groningen

GruPaTo at SemEval-2020 Task 12:Retraining mBERT on Social Media and Fine-tuned Offensive Language Models

Author: Basile Valerio
Caselli Tommaso
Colla Davide
Granitzer Michael
Mitrović Jelena
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

We introduce an approach to multilingual Offensive Language Detection based on the mBERT transformer model. We download extra training data from Twitter in English, Danish, and Turkish, and use it to re-train the model. We then fine-tuned the model on the provided training data and, in some configurations, implement transfer learning approach exploiting the typological relatedness between English and Danish. Our systems obtained good results across the three languages (.9036 for EN, .7619 for DA, and .7789 for TR)

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen